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In this paper we develop a nonparametric regression method that 
is simultaneously adaptive over a wide range of function classes for 
the regression function and robust over a large collection of error dis- 
tributions, including those that are heavy-tailed, and may not even 
possess variances or means. Our approach is to first use local medi- 
ans to turn the problem of nonparametric regression with unknown 
noise distribution into a standard Gaussian regression problem and 
then apply a wavelet block thresholding procedure to construct an 
estimator of the regression function. It is shown that the estimator 
simultaneously attains the optimal rate of convergence over a wide 
range of the Besov classes, without prior knowledge of the smoothness 
of the underlying functions or prior knowledge of the error distribu- 
tion. The estimator also automatically adapts to the local smoothness 
of the underlying function, and attains the local adaptive minimax 
rate for estimating functions at a point. 

A key technical result in our development is a quantile coupling 
theorem which gives a tight bound for the quantile coupling between 
the sample medians and a normal variable. This median coupling 
inequality may be of independent interest. 

1. Introduction. A standard nonparametric regression model involves 
observation of {xi,Yi} where 

(1) Yi = f(xi) + &, i = l,...,n. 

Most of the theory that has so far been developed for such a model involves 
an assumption that the errors are independent and identically-distributed 
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(i.i.d.) normal variables. These assumptions are suitable for a wide range of 
applications of the model. In the Gaussian noise setting many smoothing 
techniques including wavelet thresholding methods have been developed and 
shown to be highly adaptive. However, when the noise £j has a heavy-tailed 
distribution, these techniques are not readily applicable. For example, in 
Cauchy regression where £j has a Cauchy distribution, typical realizations 
of £j contain a few extremely large observations of order n since 

( 1 l\ n 
P(max{^j} > n) = I — arctan(n) H — I — ► exp 
\vr 2/ 

In contrast, the largest observation of the noise £j in Gaussian regression is 
of order ylogn. It is thus clear that the classical denoising methods designed 
for Gaussian noise would fail if they are applied directly to the sample 
{1^} when the noise in fact has a Cauchy distribution. Standard wavelet 
thresholding procedures would also fail in such a heavy-tailed noise setting. 
See Section 3.2 for further discussions. 

In the usual nonparametric regression case the regression function / is 
often alternatively described as the conditional expectation f(xi) = E(Yi\xi). 
However, if the error distributions fail to have a mean, then this conditional 
expectation will not exist. Even when the conditional expectation exists, 
estimating the conditional expectation may be a very non-robust goal, and 
not suitable for particular applications. For error distributions that may be 
heavy tailed it seems more suitable to estimate the conditional median of 
Yi. Hence, in the sequel we assume (1) holds with 

(2) £j i.i.d. and median(£j) = 0. 

There are practical situations for which the normality assumption is not 
satisfactory. See, for example, Stuck and Kleiner (1974), Stuck (2000) and 
references therein. It is necessary to develop methods to be used in such 
cases, and to establish the theoretical properties of these methods. In this 
paper we develop an estimation method that is simultaneously adaptive over 
a wide range of function classes for / and robust over a large collection of 
error distributions for £j, including those that are heavy-tailed, and may not 
even possess variances or means. In brief, our method may be summarized 
as a blockwise wavelet thresholding implementation built from the medians 
of suitably binned data. We first divide the interval [0, 1] into a number of 
equal- length subintervals, then take the median of the observations in each 
subinterval, and finally apply the BlockJS wavelet thresholding procedure 
developed in Cai (1999) to the local medians together with a bias correction 
to obtain an estimator of the regression function /. 

Unlike most wavelet methods, the performance of the algorithm here is 
not sensitive to the tail behavior of the distribution of £j , and hence can be 
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shown to have the necessary robustness property. We show that the estima- 
tor enjoys a high degree of adaptivity and robustness. It is shown that the 
estimator simultaneously attains the exact optimal rate of convergence over 
a wide range of the Besov classes, without prior knowledge of the smoothness 
of the underlying function or prior knowledge of the error distribution. The 
estimator also automatically adapts to the local smoothness of the under- 
lying function, and attains the local adaptive minimax rate for estimating 
functions at a point. 

Donoho and Yu (2000) considered this model for a-stable noise, but the 
risk properties of their proposal are unclear. In the wavelet regression setting, 
Hall and Patil (1996) studied nonparametric location models and achieved 
the optimal minimax rate up to a logarithmic term, but under an assump- 
tion that £j has a finite fourth moment. As we noted, our results do not need 
the existence of the mean for the noise or prior knowledge of the error dis- 
tribution. Most closely related to our work is Averkamp and Houdre (2003, 
2005) where the optimal minimax rate of global risk is studied. But their 
noise is assumed to be known, and their results are not adaptive. 

The key technical result in our development is a quantile coupling theorem 
that is used to connect our problem with a more familiar Gaussian setting. 
The theorem gives a tight bound for the quantile coupling between the 
medians of i.i.d. random variables and a normal variable. The result enables 
us to treat the medians of the observations in the subintervals as if they 
were normal random variables. The coupling theorem may be of independent 
interest, since analogous coupling theorems for means have proved to be an 
important general tool in many contexts. See Section 2 for this result and 
for further discussion and citations to the literature on quantile coupling. 

The paper is organized as follows. In Section 2 we derive a quantile cou- 
pling inequality for medians and obtain a moderate large deviation result. 
This coupling inequality is needed for the proof of the asymptotic proper- 
ties of our estimation procedure, and may be of independent interest for 
other statistical applications. Our procedure is defined in Section 3.2 and its 
asymptotic properties are described in Section 4. Section 5 contains further 
discussion of our results, and formal proofs are contained in Section 6. The 
reader interested only in the definition of our wavelet regression procedure 
and a description of its properties can skip Section 2 and proceed directly 
to Section 3. 

2. Quantile coupling for median. We begin with a brief introduction to 
quantile coupling. Let X be a random variable with distribution G and Y 
with a continuous distribution F. Define 



(3) 



X = G-\F{Y)), 
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where G~ 1 (x) = mf{u^G(u) >x}, then C(X) = C{X) [cf. Pollard (2001), 
page 41]. Note that X and Y are now defined on the same probability 
space. This makes it possible to give a pointwise bound between X and Y . 
The first tight bound of quantile coupling between the sum of i.i.d. random 
variables with a normal random variable was given in Komlos, Major and 
Tusnady (1975). A bound for the coupling of a Binomial random variable 
with a normal random variable is given as follows. For X ~ Binomial(n, 1/2) 
and Y ~ iV(n/2, n/4), let X(Y) be defined as in equation (3). Then for some 
constant C > and e > 0, when \X\ < en, 

(4) \X -Y\<C + &-^. 

n 

This result plays a key role in the KMT/Hungarian construction to couple 
the empirical distribution with a Brownian bridge. A detailed proof of the 
result can be found in Mason (2001) and Bretagnolle and Massart (1989). 
A general theory for improving the classical quantile coupling bound was 
given in Zhou (2005). 

Standard coupling inequalities are mostly focused on the coupling of the 
mean of i.i.d. random variables with a normal variable. In this section we 
study the coupling of a median statistic with a normal variable. We derive 
a moderate deviation result for the median statistic and obtain a quantile 
coupling inequality similar to the classical KMT bound for the mean. This 
coupling result plays a crucial role in this paper. It is the main tool for 
reducing the problem of robust estimation with unknown noise to a well 
studied problem of Gaussian regression with unknown variance. The result 
here may be of independent interest because of the fundamental role played 
by the median in statistics. 

Let X\, . . . , X n be i.i.d. random variables with density function h. Denote 
the sample median by X me( j . We will construct a new random variable X me< i 
by using quantile coupling in (3) such that £(A" me d) = C(X me d) and show 
that A me d can be well approximated by a normal random variable as equa- 
tion (4). We need the following assumptions on the density function h(x) to 
derive the quantile coupling inequality. 

Assumption (Al). jl^^x) = ~, h(0) > 0, and h(x) is Lipschitz at 
x = 0. 

Here the Lipschitz condition at means that there is a constant C > 
such that \h(x) — h(Q)\ < C\x\ in an open neighborhood of 0. This condition 
implies that h is continuous at 0. We assume h(0) > so that the median 
of the distribution is unique and the distribution of the sample median 
is asymptotically normal [cf. Casella and Berger (2002), page 483]. The 
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Lipschitz condition is assumed so that a moderate large deviation result 
for the distribution of sample median can be obtained to derive a quantile 
coupling inequality as in equation (4). 

Theorem 1. Let Z be a standard normal random variable and let X±, . . . , 
X n be i.i.d. with density function h where n = 2k + 1 for some integer 
k > 1. Let Assumption (Al) hold. Then for every n there is a mapping 
X med (Z) : 1R i— > M such that £(X med (Z)) = C(X mcd ) and 

~ C C ~ ~ 

(5) \V4^h(0)X med -Z\<^ + ^\V^h{0)X meA \ 2 when \X mcd \ < e 



where C, e > depend on h but not on n. 

The quantile coupling bound here is similar to the classical KMT bound 
(4) for the sample mean. This result has close connection to strong approx- 
imation of quantile process in Csorgd and Revesz (1978). The condition of 
Theorem 1 here is weaker. Only a Lipschitz condition at x = is assumed 
here to establish the non-uniform bound given in (5). As shown in Zhou 
(2005), the classical quantile coupling bound for the mean can be improved 
when the distribution of Xi is symmetric. Similarly, if we assume h'(0) = 0, 
the bound in Theorem 1 can be improved from the rate l/\/n to the rate 
1/n. See section 4 for more details. The bound in Theorem 1 can also be 
expressed in terms of Z, as follows. 

Corollary 1. Under the assumption of Theorem 1, the mapping X med (Z) 
in Theorem 1 satisfies 

~ C 

(6) |v / 4^/i(0)X mcd - Z\ < —= (1 + \Z\ 2 ) when\Z\<ey/n 

\ n 



where C , e > do not depend on n. 

Remark 1. When n = 2k is even, the sample median X me d is usually 
taken to be (X(q + X^ + i^)/2. Similar quantile coupling inequalities as (5) 
and (6) can be obtained. For each i, let X_i med be the median of the original 
sample with Xi removed. Then X mcd = - J2?=i ^-i,med- Let G n -± be the dis- 
tribution of the median of n — 1 i.i.d. observations with density h and define 
(2i)i<i<„~ JC^ 1 o G n _!(X_ iimcd ), 1 < % < n). Let X^ mcd = G-^(Zi). 
Then C(X_ ijincd , 1 < i < n) = C(X_ i mcd , 1 < i < n). Now a direct applica- 
tion of Theorem 1 gives 

|X mcd -z\< 4=(i + \V^h(0)(\X {k) \ + |X (fc+1) |)| 2 ) 



when |X(fc)| + < e, and Z = ^X^=i So in Sections 3 and 5 we 

assume the number of observations in each bin is odd without loss of gener- 
ality. 
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The coupling result given in Theorem 1 in fact holds uniformly over a rich 
collection of distributions. For < e\ < 1 and 62 > define 

H ei , e2 = {h: f° h(x) = \,e 1 <h(0)<-, 
I J-00 ^ ei 

(7) 

\h(x)-h(0)\ <^ for all \x\ < e 2 
ei 

It can be shown that Theorem 1 holds uniformly for the whole family of 
h G W £1 , e2 . 

Theorem 2. Let X\, . . . ,X n be i.i.d. with density h G 7i eitt2 . For every 
n = 2k + l with integer k>l, there is a mapping X me d(Z) : M 1— > M. such that 
C(X me( ±(Z)) = £(X mcc j) and for two constants C ei:t2 , £ tl ,e 2 > depending 
only on t\ and €2 



\V^h(0)X mcd -Z\< %^ + %i|^(0)X med | 2 

uniformly over all h G 7i ei ^ 2 . 

Remark 2. The quantile coupling inequalities in Corollary 1 and Re- 
mark 1 also hold uniformly over TC tl ,t 2 by replacing C and e there with two 
constants depending t\ and €2- 



3. Methodology for robust wavelet regression. We now define our robust 
nonparametric regression estimator. Then we apply the median quantile 
coupling results developed in the previous section to establish its asymptotic 
properties. 

As we have mentioned, the first key step in our approach is to bin the data 
according to the values of the independent variable. The sample median is 
then computed within each bin. This leads to a new data situation in which 
the bin centers are treated as the independent variables in a nonparametric 
regression, with the bin medians being the dependent variables. This new 
situation can then be satisfactorily viewed as if it were a Gaussian regression 
problem. It is important that the number of bins be chosen in a suitable 
range. For the applications in our paper it turns out to be appropriate to 
choose the number of bins to be T x ra 3 / 4 , where n is the original sample 
size. It appears that such a choice of T would also be suitable for use with 
many other Gaussian nonparametric regression methods. 

Proceeding in this way one should expect as a heuristic principle that 
the resulting nonparametric procedure will inherit the asymptotic optimal- 
ity properties of the Gaussian nonparametric regression technique that is 
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employed. Of course, this heuristic principle needs to be established in par- 
ticular cases. The difficulty of doing so will depend on the nature of the 
Gaussian technique and the generality of the asymptotic assumptions. 

In the present treatment we choose to employ a Gaussian wavelet method 
involving a block James-Stein wavelet estimator. Implementation of the pro- 
cedure is straightforward since the number of bins can be chosen as a power 
of 2, as is especially convenient for wavelet implementation. This estimator 
enjoys excellent asymptotic adaptivity properties in the Gaussian setting. 
We show that the current binned-median version has analogous properties 
over nearly the same range of Besov balls as does the original Gaussian pro- 
cedure. The precise statement of asymptotic properties is contained in The- 
orems 3 and 4. The full strength of the asymptotic properties of our wavelet 
procedure in a Gaussian setting depends on detailed moderate-deviation 
properties of the Gaussian distribution. For this reason our proof of asymp- 
totic properties of the binned median version requires careful treatment of 
moderate-deviation properties of the binned medians, as in the coupling 
results established in Section 2. 

We shall focus on the case where the design points {x{}, are equally 
spaced on the interval [0,1]. The more general case will be discussed at the 
end of Section 4. The procedure, which will be described in detail in the 
next section, can be briefly summarized as follows. Let the sample {Yi,i = 

1.. ..,n} be given as in (1) where x% = ^ and the noise variables £j are 

1.1. d. with an unknown density h. Let J = [log 2 n 3//4 J . Set T = 2 J and m = 
n/T. We divide the interval [0, 1] into T equal-length subintervals. Note that 
T x n 3 / 4 . For l<j<T, let Ij = {Y { : x { G (^1, £]} be the jth bin and let 
Xj be the median of the observations in Ij. We treat Xj as if it were a 
normal random variable with mean f{rp) + b m and variance l/(4m/i 2 (0)) 
(see Theorem 1), where 

(9) 6 m = £^median(£i,...,£ m )}. 

Then apply a nonparametric Gaussian regression procedure. In this pa- 
per, we apply the BlockJS wavelet thresholding procedure developed in Cai 
(1999) to construct an estimator of /. The final estimator / is given in 
equations (16) and (18). 

We begin in Section 3.1 with a brief introduction to wavelet block thresh- 
olding in the Gaussian regression setting and then give a detailed description 
of our wavelet procedure for robust estimation in Section 3.2. 

3.1. Wavelet block thresholding for Gaussian regression. Let {<J),ip} be a 
pair of father and mother wavelets. The functions 4> and tjj are assumed to 
be compactly supported and J <p= 1. Dilation and translation of <j) and ip 
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generates an orthonormal wavelet basis. For simplicity in exposition, in the 
present paper we work with periodized wavelet bases on [0, 1]. Let 

oo oo 

= E <M*-0, ^, k (t)= E <M*-0 forte [0,1] 

l=— oo l=— oo 

where 0j, fc (i) = 2 j / 2 (j)(2 j t - k) and Vj,fc(*) = 2 j / 2 ^(2 j t - k). The collection 
{<f£ Q k , k = 1, . . . , 2i°;if£ k ,j > jo > 0, k = 1, . . . , 2 J } is then an orthonormal 
basis of L 2 [0,1], provided the primary resolution level jo is large enough 
to ensure that the support of the scaling functions and wavelets at level jo 
is not the whole of [0, 1] . The superscript "p" will be suppressed from the 
notation for convenience. An orthonormal wavelet basis has an associated 
orthogonal Discrete Wavelet Transform (DWT) which transforms sampled 
data into the wavelet coefficients. See Daubechies (1992) and Strang (1992) 
for further details about the wavelets and discrete wavelet transform. A 
square-integrable function / on [0, 1] can be expanded into a wavelet series: 

Z>0 oo li 

(10) fit) = 0j O ,k<t> jo ,k(t) + E E h*^) 

fe=l j=j k=l 

where Qj^ = {fi4>j,k),Qj,k = {f>ipj,k) are the wavelet coefficients of /. 

The BlockJS procedure was proposed in Cai (1999) for Gaussian nonpara- 
metric regression and was shown to achieve simultaneously three objectives: 
adaptivity, spatial adaptivity, and computational efficiency. The procedure 
can be most easily explained in the sequence space setting. Suppose we 
observe the wavelet sequence data: 

(11) yj,k = 0j,k + &z jjk , j > j , k = 1,2, . . . ,2 j 

where Zj t k are i.i.d. N(0, 1) and the noise level a is known. The mean vector 9 
is the object of interest. The BlockJS procedure is as follows. Let J = [log 2 n] . 
Divide each resolution level jo < j < J into nonoverlapping blocks of length 
L = [logn] (or L = 2L lo S2(i°g")J ~ \ ogn ). Let B) denote the set of indices of 

the coefficients in the i-th block at level j, that is, Bj = {(j, k) : [i— 1)L + 1 < 
k < iL}. Let S 2 ^ = J2(j,k)eB i Vjk denote the sum of squared empirical wavelet 
coefficients in block B l y A James-Stein type shrinkage rule is then applied 
to each block fij. For (j,k) G £>], 

(12) § k = I ( 1 ~ yj ' kl for ^ G B% r j0 - 3 K J ' 

I 0, for j > J, 

where A* = 4.50524 is a constant satisfying A* — log A* = 3. The threshold 
A* = 4.50524 is selected according to a block thresholding oracle inequality 
and a minimax criterion. See Cai (1999) for further details. 
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3.2. Wavelet procedure for robust regression. Now we are ready to give 
a detailed description of our procedure for robust estimation. Hereafter we 
shall set g(t) = f(t) + b m where b m is given as in (9). 

Apply the discrete wavelet transform to the binned medians X = (X\ Xt), 
and let U = T~ l / 2 W X be the empirical wavelet coefficients, where W is the 
discrete wavelet transformation matrix. Write 

(13) U = (y jo ,i,.. -,^0^0^0,1, ■ ■■iVjo&o,- ■ ■ ,VJ-l,l, • • • .I/J-1.2J-0'- 

Here yj ,k are the gross structure terms at the lowest resolution level, 
and yj^k {j ' = jo, ■ ■ ■ , J — 1 ; k = 1, . . . , 2 J ) are empirical wavelet coefficients 
at level j which represent fine structure at scale 2 J . The empirical wavelet 
coefficients can be written as 

( 14 ) Vj,k = °j,k + tj,k + 2h(0)^h~ Zj ' k + ^' fc ' 

where 6j t k are the true wavelet coefficients of g = / + b m , e^fc are "small" 
deterministic approximation errors, Zj± are i.i.d. N(0, 1), and £j t k are some 
"small" stochastic errors. The theoretical calculations given in Section 6 will 
show that both the approximation errors e,- & and the stochastic errors & 
are negligible in certain sense. If these negligible errors are ignored then we 
have 

as) y ^ e ^2hm^ z ^ 

which is the same as the idealized sequence model (11) with noise level 
ff = l/(2/»(0)Vn). 

The BlockJS procedure is then applied to the empirical coefficients as 
if they are distributed as in (15). More specifically, at each resolution level j, 
the empirical wavelet coefficients yjk are grouped into nonoverlapping blocks 
of length L. As in the sequence estimation setting let Bj = {(j,k) : (i — l)L + 

1 < k < iL} and let Sj^ = Y^tt fyeB*. Vj k- ^ (0) an estimator of h 2 (0) 
[see equation (38) for an estimator]. A modified James-Stein shrinkage rule 
is then applied to each block Bj, that is, 

(l6> M'-s^l'* 1 ™ 6B - 

where A* = 4.50524 is the solution to the equation A* — log A* = 3 and 
4/i 2 (0)n in the shrinkage factor of (16) is due to the fact that the noise 
level in (15) is a = l/(2h(0)y/n). For the gross structure terms at the low- 
est resolution level jo, we set 6j 0: k = 2/j ,fc- The estimate of g at the equally 
spaced sample points : i = 1, . . . ,T} is then obtained by applying the 
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inverse discrete wavelet transform (IDWT) to the denoised wavelet coeffi- 
cients. That is, {g(-f) :i = 1, . . . ,T} is estimated by g = {<7(y) :i = 1, . . . ,T} 
with g = T^I^W -1 ■ 0. The estimate of the whole function g = f + b m is given 
by 

2J'o A J-l TP 

9(f) = Yl Qh,k<i>j ,k{t) + Yl Yl Qj^jAt). 

k=l j=j k=l 

To get an estimator of / we need to also estimate b m . This is done as follows. 
Divide each bin Ij into two sub-bins with the first bin of the size \Jj\ . Let 
Xj be the median of observations in the first sub-bin. We set 

(17) b m = ^Y( X J- X j) 

j 

and define 

2M a J-l 2? 

(18) f n (t) = g n {t) -b m = Y^ Oj ,k<l>jo,k(t) + Y Y OjfiipjAt) - b m - 

k=l j=jo k=l 

Remark 3. The quantity b m is the systematic bias due to the expecta- 
tion of the median of the noise & in each bin. Lemma 5 in Section 6 shows 
that b m = ~ gfei^d) m ~ 1 + 0(m~ 2 ). Hence this systematic bias can possibly 

be dominant if it is ignored. The estimate b m serves as "bias correction." 
Lemma 5 shows that the estimation error of b m is negligible relative to the 
minimax risk of /„ when m = Ofa 1 ^). 

4. Adaptivity and robustness of the procedure. We study the theoreti- 
cal properties of our procedure over the Besov spaces that are by now stan- 
dard for the analysis of wavelet regression methods. Besov spaces are a very 
rich class of function spaces and contain as special cases many traditional 
smoothness spaces such as Holder and Sobolev spaces. Roughly speaking, 
the Besov space B° 9 contains functions having a bounded derivatives in 
]j> norm, the third parameter q gives a finer gradation of smoothness. Full 
details of Besov spaces are given, for example, in Triebel (1992) and DeVore 
and Popov (1988). For a given r-regular mother wavelet ip with r > a and 
a fixed primary resolution level jo, the Besov sequence norm || • \\b% of the 
wavelet coefficients of a function / is then defined by 

oo \ 1/9 

(is) ll/lk 8 = liL n lip+(E( 2is il^ 



\J=J0 



where £ . is the vector of the father wavelet coefficients at the primary 
resolution level jo, Oj is the vector of the wavelet coefficients at level j, and 
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s = a + 7} — p > 0- Note that the Besov function norm of index (a,p, q) of a 
function / is equivalent to the sequence norm (19) of the wavelet coefficients 
of the function. See Meyer (1992). We define 

(20) B« q (M) = {f-\\f\\ b « q <M}. 

In the case of Gaussian noise Donoho and Johnstone (1998) show that the 
minimax risk of estimating / over the Besov body Bp q (M), 

(21) R*(B« q (M)) = m£ sup £||/-/||| 5 

converges to at the rate of ri - 2a /( 1 + 2a ) a s n — > oo. 

In addition to Assumption (Al) in Section 2, we need the following weak 
condition on the density h of £$. 

Assumption (A2). / \x\ e3 h(x)dx < oo for some e 3 > 0. 

This assumption guarantees that the moments of the median of the binned 
data are well approximated by those of the normal random variable. Note 
that Assumption (A2) is satisfied by Cauchy distribution for any < €3 < 1. 
For < e± < 1, Si > 0, i = 2,3,4, define H = H(ei, 62, £3, £4) by 

(22) U= (h:he W £l , e2 , |/i (3) (x)| < e 4 for \x\ < e 3 and J \x\ e3 h(x) dx < e 4 J. 

The following theorem shows that our estimator achieves optimal global 
adaptation for a wide range of Besov balls Bp AM) defined in (20) and 
uniformly over the family of noise distributions given in (22). 

Theorem 3. Suppose the wavelet ip is r -regular. Then the estimator f n 
defined in (18) satisfies, for p>2, a <r and 2 " +2 ^ 3 > K 

sup sup E\\f n -f\\l<Cn- 2a ^ 1+2a \ 
heHf€B« q (M) 

and for l<p<2, a< r and 2a , ,~ a ^ 3 > -, 

— 1 ' — l+2a p ' 

sup sup E\\f n - /||| < Cn- 2a ^ 1+2a \\ogn)^/^ 1+2 ^\ 
henfeB« q (M) 

Theorem 3 shows that the estimator simultaneously attains the optimal 
rate of convergence over a wide range of the Besov classes for / and a 
large collection of the unknown error distributions for £j. In this sense, the 
estimator enjoys a high degree of adaptivity and robustness. 
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For functions of spatial inhomogeneity, the local smoothness of the func- 
tions varies significantly from point to point and global risk given in Theorem 
3 cannot wholly reflect the performance of estimators at a point. The local 
risk measure 

(23) R(f(t )J(t )) = E(f(t ) - /(t )) 2 

is used for spatial adaptivity. 

The local smoothness of a function can be measured by its local Holder 
smoothness index. For a fixed point to £ (0, 1) and < a < 1, define the local 
Holder class A a (M, to, 5) as follows: 

A a (M,t , 8)={f: \f(t) - f(t )\ < M\t - t \ a , for t e (t - S,t + 6)}. 

If a > 1, then 

A a (M,to,5) = {f:\f^Ht)-f^\to)\<M\t-to\ a 'fovte(to-5,to + 5)} 

where [a\ is the largest integer less than a and a' = a — [a\ . 

In Gaussian nonparametric regression setting, it is a well-known fact that 
for estimation at a point, one must pay a price for adaptation. The opti- 
mal rate of convergence for estimating f(to) over function class A a (M, to, 5) 
with a completely known is n - 2a /( l + 2a ) . Lepski (1990) and Brown and Low 
(1996a, 1996b) showed that one has to pay a price for adaptation of at least 
a logarithmic factor. It is shown that the local adaptive minimax rate over 
the Holder class A a (M,t ,5) is (\ogn/n) 2a ^ l+2a \ 

The following theorem shows that our estimator achieves optimal local 
adaptation with the minimal cost uniformly over the family of noise distri- 
butions defined in (22). 

Theorem 4. Suppose the wavelet ip is r -regular with r > a> 1/6. Let 
to G (0,1) be fixed. Then the estimator f n defined in (18) satisfies 

(24) sup sup E(f n (to)-f{to)f<C-( 

h£Hf£A a (M,t ,8) V 



n 



Theorem 4 shows that the estimator automatically attains the local adap- 
tive minimax rate for estimating functions at a point, without prior knowl- 
edge of the smoothness of the underlying functions or prior knowledge of 
the error distribution. 

Remark 4. After binning and taking the medians, in principle any 
standard wavelet thresholding estimators could then be used. For example, 
the VisuShrink procedure of Donoho and Johnstone (1994) with threshold 
A = o~\/2 log n can be applied. In this case the resulting estimator satisfies 

4ogn\ 2a /( 1+2Q ) 



sup sup E\\f n -f\\ 2 2 <C 

h£Hf£B« q (M) 



n 
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for 1 < p < oo, a < r and 2a , , ^ 3 > - and 
(25) sup sup E(f n (t )-f(t )) 2 <C- 

h<=H feA°>(M,t ,S) 
for r > a > 1/6. 

We have so far focused on the equally spaced design case. When the design 
is not equally spaced, one can either group the sample using equal-length 
subintervals as in Section 3.2 or bin the sample so that each bin contains 
the same number of observations, and then take the median of each bin. 
The first method produces equally spaced medians that are heteroskedastic 
with the variances depending on the number of observations in the bins. In 
this wavelet procedure for heteroskedastic Gaussian noise can then 

be applied to the medians to obtain an estimator of /. The second method 
produces unequally spaced medians that are homoskedastic since the number 
of observations in the bins are the same. A wavelet procedure for unequally 
spaced observations with homoskedastic Gaussian noise can then be used to 
get an estimator of /. For wavelet procedures for heteroskedastic Gaussian 
noise or unequally spaced samples, see, for example, Cai and Brown (1998), 
Kovac and Silverman (2000) and Antoniadis and Fan (2001). 

5. Further discussion. Theorem 1 gives a general quantile coupling in- 
equality between the median of i.i.d. random variables X±, . . . ,X n and a 
normal random variable. The collection of the distributions of the i.i.d. ran- 
dom variables includes the Cauchy and Gaussian distributions as special 
cases. Note that for both Cauchy and Gaussian distributions, h'(0) = 0, 
which suggests we may have a tighter quantile coupling bound as in Zhou 
(2005). Let us further assume that h'(0) = 0, and h"(0) exists. We can derive 
a sharper moderate large deviation result for the median and then obtain 
a tighter quantile coupling inequality which improves the classical quantile 
coupling bounds with a rate 1 / y/n under certain smoothness conditions for 
the distribution function. For every n, we can show that there is a map- 
ping X mE d(Z) :R«1 such that the random variable X me d(Z) has the same 
distribution as the median X me d of X± , . . . , X n and 

|\/4re7i(0)X mcd -Z\< C-il + IZI 3 ) when \Z\ < 

n 

where C, e > do not depend on n. We can even establish an asymptotic 
equivalence result in Le Cam's sense. Assume that 

f£F={f:\f(y)-f(x)\<M\x-y\ d } 

with d > 3/4. In the current setting, we modify the procedure with T = 
re 2 / 3 / log n. Then m = n/T = re 1 / 3 log n. Recall that Xj is the median of 
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the observations on each bin Ij with 1 < j ' < T. Let r]j be the median of 
corresponding noise, then 

min / ( — ) < Xj — rij < max / ( — ) . 

(j-l)m+l<i<jm \nj {j-l)m+l<i<jm \n J 

We need to give an asymptotic justification that it is fine treating Xj as if 
it were a normal random variable with mean /(j'/T) and variance 4m ^ ■ 
We can show that observing {Xj} is asymptotically equivalent to observing 

X\ = f (-} + Zi, Zi *'~ ' n(q, Kr-^\ , 1< 3 < T 

in Le Cam's sense by showing that the total variation distance between the 
distributions of Xj's and Xj's tends to 0, that is, 

|£({A^})-£({4})| TV -0. 

The result shows that asymptotically there is no difference between observ- 
ing Xj's and observing Xj's. That means all optimal statistical procedures 
for the Gaussian model can be carried over to nonparametric robust es- 
timation for bounded losses. For instance, the asymptotic equivalence here 
implies that adaptive procedures including SureShrink of Donoho and John- 
stone (1995), the empirical Bayes estimation of Zhang (2005) and SureBlock 
of Cai and Zhou (2006) can be carried over from the Gaussian regression 
to the Cauchy regression or more general regression. The details of our re- 
sults will be reported elsewhere. Readers may find recent developments in 
the asymptotic equivalence theory in Brown and Low (1996a, 1996b), Nuss- 
baum (1996), Grama and Nussbaum (1998) and Golubev, Nussbaum and 
Zhou (2005). 

6. Proofs. We shall prove the main results in the order of Theorem 3, 
Theorem 4 and then Theorems 1 and 2. In this section C denotes a positive 
constant not depending on n that may vary from place to place and we set 
d = min(a — ^, 1). For simplicity we shall assume that n is divisible by T in 
the proof. We first collect necessary tools that are needed for the proofs of 
Theorems 3 and 4. 

6.1. Preparatory results. In our procedure, there are two steps: (1) bin- 
ning the data and taking the median in each bin; (2) applying wavelet trans- 
form to the medians and using BlockJS to construct an estimator of /. In 
this section, we give two results associated with these two steps. Recall that 
we denote by Xj the median of each bin i, in step 1 and treat Xj as if it were 
a normal random variable with mean f(j/T) — b m and variance l/(4m/i 2 (0)). 
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The coupling inequality and the fact that a Besov ball B^ q {M) can be em- 
bedded into a Holder ball with smoothness d = min(a — - , 1) > [cf. Meyer 
(1992)] enable us to precisely control the difference between Xj and that 
normal variable. Proposition 1 gives the bounds for both the deterministic 
and stochastic errors. In Proposition 2 we obtain two risk bounds for the 
BlockJS procedure used in step 2. These two bounds are used to study global 
and local adaptation in the following sections. 

Proposition 1. Let Xj be given as in our procedure and let f £ B^ q (M). 
Then Xj can be written as 



2d. 



(26) y/mXj = <Jmf (JjA + Vmb m + ^Zj + ej + Qj 

where: 

(i) ZS&NQ,^); 

(ii) ej are constants satisfying \ ej\ < C^fmT~ d and so ~ e j — 
(hi) Cj are independent and "stochastically small" random variables sat- 
isfying with EQ = 0, for any I > 

(27) E\(j\ l < Cim^l 2 + Cim l l 2 T~ dl 
and for any a > 

(28) P(\(j\ >a)< d{a 2 m)- 1 / 2 + d(a 2 T 2d /m)^ 2 
where C\ > is a constant depending on I only. 

Proof. Let rjj = median({£j : (j — l)m + 1 < i < jm}). We define Zj = 
j^j^ 1 (G(r]j)) where G is the distribution of rjj. It follows from Theorem 

1 that V 4mr]j is well approximated by Zj whose distribution is iV(0, j^m ) • 
Set 



mEX 



mf 



T 



mbr. 



= E^/mXj - y/mfl^J - \fmr)j 

This is the deterministic component of the approximation error due to bin- 
ning. It is easy to see that 



(29) 



mm 

(j— l)m+l<i<jm 



f 



n 



<Xj- Vj -f 



T 



f 



< 



max 



{j-l)m+l<i<jm\_ \n 



f 
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Since / is in a Holder ball with smoothness d = min(a — -, 1), then equation 
(29) implies 



(30) 



Set 



\ej\ < yfmE 



max 

(J— l)m+l<i<jm 



Cj = ^frnXj - Vmf^—j - Vmb m - ej - -Zj. 

Then EQ = and yJraXj = y/mf(j/T) + ej + \Zj + (j. The random error 
Cj is the sum of two terms, (ij = y/mXj — y/mf(j/T) — y/mrjj — ej and 
(2j = \ffnf]j where Cij is the random component of the approximation 

error due to binning, and C, 2 j is the error of approximating the median by 
the Gaussian variable. From equation (29) we have|Cij| < C^pmT~ d and so 

(31) E\Cij\ l <Cim l/2 T~ dl . 

A bound for the approximation error £ 2 j is given in Corollary 1, 



(32) 



C 



|0y| < — 17^(1 + l^il ) when \Zj\<e^ 



for some e > 0, and the probability of \Zj\ > e^fm is exponentially small. 
Hence for any finite integer I > 1 (here I is fixed and m = n 1 — > oo), 

E\( 2j \ l = E\( 2j \ l {\Zj\ < s^M] + E\C2j\ l {\Zj\ > evM 

< Qm^ 2 + (E\t 2j n^ [Pm > eV^}} 1/2 

for some constant Q > 0, where 

1 / e 2 
P{\Z\ > E\fm\ < - exp ——m 



by Mill's ratio inequality 

<p(x 



(33) 

and 
(34) 



l-$(x) 



> max< x 



for x > 



E\^rjj\ 21 < m l E\r]j\ 21 < Dim 1 
for some constant D\ > because of Assumption (A2), so we have 
(35) E\C 2j \ l <dm- 1 / 2 . 
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Details for equation (34) are as follows. Assumption (A2) implies 

C 



P(\Zi\>\x\)< 



For m = 2v + 1 i.i.d. £j, from equation (65) the density of the sample median 
is 



AH(x)(l - H(x))] v h(x) exp O 



< 



2tt 
8^ 



AC 



K3 



AC 



h(x) exp ( 0[ - 



1^3/2 



|ue 3 /2 



/i(x) exp ( O 



When |x| £3 / 2 >8C, we have 



AC 



£3/2 



< 



2vr2^ 



which is bounded for all v. This implies as v — > oo (m ~ n 1 in our procedure) 
the median has any finite moments. 
Thus we have 



E\Cj\ l < 2 i - 1 (JB7]Cij| i + ^|C2if) < Qm^ 1/2 + C/m'/ 2 ^ 



from equations (31) and (35). Equation (28) then follows from Chebyshev's 
inequality. □ 



Remark 5. In the proof of Proposition 2, we will see that the noise 
Q has negligible contribution to the risk of our procedure comparing with 
the Gaussian noise \Zj, when the tail bound P(|£?| > a ) decays faster than 
any polynomial of n. For m = n 1 we have T 2d /m = n 2d -i{ 2d + l ) , Then from 
equation (28) it is enough to require < 7 < ^J^x , that is, 

(36) d = min(a-i,l)>^y 

which is satisfied under our assumption (see also Remark 7). 

Remark 6. In the proofs of Theorems 3 and 4, we shall assume without 
loss of generality that h(0) is known and equal to 1 since it can be estimated 
accurately in the sense that there is an estimator h(0) such that 



(37) 



P{\h~ 2 (0) - h~ 2 {0)\ > n- 5 } < cin- 1 
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for some 5 > and all / > 1. For instance, we may estimate h 2 (0) by 

8m 



(38) £- 2 (0) = ^£(A^i-A^) 2 - 



Note that Efj^ E(^2fc-i - X 2k ) 2 = \h~ 2 (0) + 0(^T~ d ), and it is easy to 
show 



E 



8m 



- T J2( X ^~i-X 2k ) 2 -h- 2 (0) 



d\l 



<Q(V^T- d ) 



(39) y jjk = 0' jk + e jtk + 7T7^ z j,k + Cj,k 



where ^/mT~ d = n~ s with 5 > in our assumption. Then equation (37) holds 
by Chebyshev inequality. It is very important to see that the asymptotic 
risk properties of our estimator (16) does not change when replacing A* by 
A*(l + 0(n~ 5 )), thus in the rest of our analysis we may just assume that 
h(0) = 1 without loss of generality. 

We now consider the wavelet transform of the medians of the binned data. 
From Proposition 1 we may write 

—x = iWH + Jl + J±- + 

Let {yj,k) = T~ 1 / 2 W ■ X be the discrete wavelet transform of the binned 
data. Then one may write 

1 

where 9j k are the discrete wavelet transform of (<?(y ))i<i<T which are ap- 
proximately equal to the true wavelet coefficients of g, Zj^ are the trans- 
form of the Zi's and so are i.i.d. N(0, 1) and ej^ and ^ are respectively 
the transforms of (-4^) and (t^)- The following proposition studies the risk 
of BlockJS procedure in Step 2. For each single block the risk bounds here 
for BlockJS are similar to results in Cai (1999) where Gaussian noise was 
considered. But in the current setting the error terms ej k and £«• £ make the 
problem more complicated. 

Proposition 2. Let the empirical wavelet coefficients = Oj k + £j,k + 
2^ z j,k + £j,k be given as in (39) and let the block thresholding estimator 9j k 
be defined as in (16). Then: 

(i) for some constant C > 

E E (hk-0' j!k ) 2 <mmL E ty^SKLn-A 



(40) 



+ 6 E 4k + CLn- 2 ; 
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(ii) for any < r < 1 , there exists a constant C T > depending on r only 
such that for all (j, k) £ Bj 



(41) E{6 hk -e' jk ) 2 <C T -m\J max {{6' k + e j , fc ) 2 }, Ln" 1 ) + 



We need the following lemmas to prove Proposition 2. These three lemmas 
are from Brown et al. (2006). See also Cai (1999). 

Lemma 1. Let X\, . . . , X n be independent random variables with E(Xi) = 
for i = 1, ... ,n. Suppose that E\Xi\ k < M& for all i and all k > with 
M& > some constant not depending on n. Let Y = WX be an orthogonal 
transform of X = (X±, . . . ,X n )'. Then there exist constants M' k not depend- 
ing on n such that E\Yi\ k < ML for all i = 1, . . . ,n and all k > 0. 



Lemma 2. Suppose yi = 6% + Zi,i = 1, . . . ,L, where 6i are constants and 
Zi are random variables. Let S 2 = J2i=i v i an d ^ @i = (1 ~~ Then 

(42) e\\§ - e\\l < \\e\\l a axl + 4s[||j2||1j(||«||| > xl)]. 



Lemma 3. Let X ~ x| an d A > 1. Then 

P(X > XL) < e -(V2)(A-iogA-i) and 

(43) 

EXL(X > XL) < ALe -(i/2)(A-logA-i) 

Proof of Proposition 2. We only give the proof for (i). From Propo- 
sition 1, we have \ej\ < C\fmT~ d and €j± = J2i ^ / ^J^j.k- Hence 

(44) \e jtk \ < sup^ -^=4>J,i( x ) • / lVy,fc(aOI < Crt^'/ 2 . 
This, as well as Proposition 1, yields that 

3 k i 

It is easy to see from Lemma 1 and Proposition 1 that 

(46) E\tj tk \ l < C[{mn)-^ 2 + C[(T 2d n/mr^ 2 
and for any a > 

(47) P(\Cj, k \ >a)< C[(a 2 mny l l 2 + C[{a 2 T 2d n/m)- 1 / 2 . 
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It follows from Lemma 2 that 

E E ~~ ^,fc) 2 

<2minj {.0' hk + e hk fAKLn- l \+2 ]T £j 2 

+^ E .(^> + ^) 2j ( E (27^+^ 



2 



> 



4n 



<min|4 2 (^.SA.Ln-H+e E 4* 

+ 2n~ 1 £! E (zj,k + 2^j,k) 2 l( E (^fc + 2^, fc ) 2 >A,Lj. 



Denote by ^4 the event that all \£j.k\ are bounded by 2 JriL ) that ^ s 

A = {|2v^j, fc | < L _1 for all (j, k) G B}}. 
Then it follows from (47) that for any I > 1 

P(A C )< X! ^W^l^ -1 ) 

(j.*)eBj 

(48) 

< C[{L~ 2 m)~ 1 / 2 + C{{L- 2 T d lm)- l l' 1 . 

Hence 

L> = £ J] (^, fc + 2 v / ^Ci,fc) 2 /( E (*J,k + Vn£j jfc ) 2 >A.-M 
(i,fe)e-B* Hi,fe)6Bj / 

= E E (^,fc + 2 v / ^Ci,fc) 2 ^(^n J] (^• fc + 2 v ^^fe) 2 >A*Lj 



+ £ J] (^,fc + 2 v / ^) 2 ^U C n E (^,fc + 2^,fc) 2 >A*Lj 
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Note that for any L > 1, (x + y) 2 < j^jx 2 + Ly 2 for all x and y. It then 
follows from Lemma 3 and Holder's inequality that 

D 1= E Y {zj,k + 2^j,k) 2 l[An Y {z h k + 2^ hk f>KL\ 

< 2E E E zi k >KL-K-i\ 

+ 8nE Y %,kl( E z j,k > A*L - A* - 1 J 
< 2 (XL - A* - l) e -V2(A,-(A,+l)i-i-log(A,-(A,+l)L-i)-l) 



£ 2 i. > A*-L — A* — 1 



where p, q > 1 and - + - = 1. For m = n e we take ^ = 1 — e. Then it follows 

^' * p g q 

from Lemma 3 and (46) that 

£>i < A,e( A * +1 )/ 2 Ln -1 + CLm^n' 1 "' = CLn" 1 . 

On the other hand, it follows from (46) and (48) (by taking I sufficiently 
large) that 

D 2 = E (z Jik + 2^,k) 2 lU c n Y {zj,k + 1^ hk f>KL\ 

{j,k)eB* V (j,k)eB* ' 

<E Y (2z 2 ik + 8nZl k )I(A c ) 

< y [2(^) 1/2 +^(^) 1/2 ]-( p (^ c )) 1/2 

(i,fc)6Bj 

< n" 1 . 

Hence, D = D\ + L>2 < CLnT 1 and consequently 

# E (^-^, fc ) 2 <min(4 £ (^.SA^n" 1 ] 

+ 6 Y e lk + CLrT 2 

U,k)£B) 
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for some constant C > 0. □ 



Recall that O'-^s are the discrete wavelet transform of (f(-f))i<i<T and 
9j fc's are true wavelet coefficients of /. The following lemma will be used to 
bound the difference of djk's and 0j,kS. The proof is straightforward and is 
thus omitted. 

Lemma 4. Let T = 2 J and let fj(x) =Efc=i ^/(t)^,*^)- Then 

sup \\fj- f\\l<CT- 2d where d = min(a - 1/p, 1) . 

feB« q (M) 

Also, \6' jk - 6 j;k \ < CT~ d 2-i/ 2 and consequently EjZj 2~2 k { d 'jk - #;,fc) 2 < 
CT~ 2d . ' 

Lemma 5. Xei b m and b m be defined as in (9) and (17), respectively. 
Then 



(49) sup 

h&H 



b + H ' {0) 



8/i 3 (0)?n 



< Cm 



(50) sup sup E(b m - b rn ) 2 < C • max{T- 2d , m" 4 }. 

Proof. It suffices to consider the case that m = 2v + 1 with v £ N (cf. 
Remark 1), then 

££med = J X {2V ^ )l H V (x)(l-H(x)) V dH(x), 

where H is the distribution function of £i . For any 5 > 0, set ^ = {x : |-ff(x) — 
t;| < <5}. It follows from the definition of that there exists a constant 5 > 
such that for some e > we have 

(51) |/i (3) (x)| < 1/e and e<h(x)<l/e 

uniformly over all h E 7i for all x £ A$. This property implies H~ 1 (x) is well 
defined and differentiable up to the fourth order for x £ As. Decompose the 
expectation of the median into two parts: 

££med=(7 + / W 2 " ^ V (x)(l - H{x)) v dH{x) = Ql + Q 2 . 

Since the median has finite moments from equation (34), it is easy to see Q2 
decays to exponentially fast as v = 0(n 1//4 ) — > 00 by the Cauchy-Schwarz 
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inequality and tail probability equations (63) and (64). We now turn to Q\. 
Note that 

•1/2+5 



^C(*-' w -^))^" (1 -^ 

2 (#-!)(%)/ 



1/2 
1/2+5 



1/2-5 



1 



i 



24 



x 

2 



(2t> + 1)! „ 



(v\) 



h2 



^(l-xfcte 



since x v (l — x)^ is symmetric around x = ^. Note that ^JyP' — x) v is 
the density function of Beta(t> + l,v + 1), and equation (51) implies that 
(H^ 1 )^ (?) is uniformly bounded over all h£?i, then 



'I —2 

m z 



8/i 3 (0)m 



,2/ (2tj + 2) 2 (2v + 3) 

and (49) is established. 

Note that for m = 2v + 1, [yj = From Proposition 1 we have 



+ ^ 



m- 



X j = f[^-)+b m + 



1 



2x/m 



1 



Similarly we may write 



T 



+ b v + ^-^Z* + —c 



Cr 



-C* 



2v^ J 

with Z*, e| and £j satisfying properties (i), (ii), (iii) of Proposition 1, re- 
spectively. Then b m — b m = ^ J2j (Xj ~ Xj ) ~~ b m can be written as a sum of 



five terms as follows: 



T 



i - 1/2 

T 



v 3 



1 1 



+ (b v - 26, 



i i 



E^; 



i i 

VraT 



E^ 



i i 



Ec; 
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It is easy to see that swpf €B a ^ R 2 < CT~ 2d and sup hen R 2 < CmT A . 
Proposition 1 yields sup heH j eBa t M \ R 2 < CT~ 2d . Note that Z*- — Zj are 
independent for j = 1, . . . ,T. So ER\ < prgy^ + < Cn' 1 . Similarly, 
Cj — Q are independent and it then follows from Proposition 1 that ER 2 = 
o(n~ l ). Hence, 

sup E(b m - b m f < 5R 2 + 5R 2 . + 5Rj + hER\ + 5ERl 

heH,feB« q (M) 

<Cmax{r 2d ,m- 4 }. □ 

6.2. Global adaptation: Proof of Theorem 3. Let f n be given as in (18). 
Note that 

E\\f n - /III < 2E\\g n - g\\l + 2E(b m - b m ) 2 . 

Lemma 5 yields that E(b m — b m ) 2 = o(n^ 2a ^ 2a+1 ^) and so we need only to 
focus on bounding E\\g n — g\\%. Note that the functions / and g differ only 
by a constant b m and so the wavelet coefficients coincide, that is, Qj f. = 
J fipjfi = / g^jM- Decompose E\\g n — g\\ 2 into three terms as follows: 

J— 1 oo 

E\\9n -g\\l = J2 E &°,k ~ hk) 2 + EE E (hk - M 2 + E E e lk 

k 3=jo k j=J k 

(52) 

= S\ + 02 + 5*3- 

It is easy to see that the first term S\ and the third term S3 are small: 

(53) St = 2 J0 n- 1 e 2 = o(n~ 2a ^ 1+2a ^). 
Note that for x 6 W 71 and < p% < P2 < 00 

(54) ||x|| P2 < ||x|| pi < m 1/pi - 1/p2 \\x\\ P2 . 

Since / G B« q (M), so 2^{Yl=i \0j,k\ P ) 1/v < M - Now ( 54 ) y ields that 

00 

(55) 5 3 = E E °lk < Cl- 23 ^^ 1 ' 2 - 1 '^ . 

j=j k 

Proposition 2, Lemma 4 and equation (45) yield that 

s 2 < 2 E E E (hk - o' hk ? + 2 e - W 

3=30 k j=j k 

J-lV/L f ^ J -I 

(56) < E E min 8 E °lk> SA.Ln- 1 + 6 E E 4k 

3=30 <=i I C?',fc)eBj J i=io 
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+ Cn- 1 + 10£$>;. fc -%, fc ) 2 

3=30 k 

mi J 8 J2 d j,k, 8A*Ln _1 \ + Cn~ l + CT~ 2d . 
3=30 «=i I (j,fe)eBj J 

We now divide into two cases. First consider the case p>2. Let J\ = [ x 
log 2 n]. So, 2 Jl Psn 1 /( 1 + 2 «). Then (56) and (54) yield 

h-lV/L J-l 

S 2 < 8A* £ £ Ln x + 8 £ £ °ik + C™" 1 + CT~ 2d < Cn 2a l^ 2a \ 

3=30 t=l J=Jl 

By combining this with (53) and (55), we have E\\f n - /||| < Cn^ 2Q /( 1+2a ) 
for p>2. 

Now let us consider the case p <2. First we state the following lemma 
without proof. 

Lemma 6. Let < p < 1 and S = {x £ < JB,Xi > 0,i = 

T/ien /or A >0 ; sup xe5 E*=i(a* A A) < B ■ A 1 '*. 

Let J 2 be an integer satisfying 2 J2 x n 1 /^ 2 ") (logn)( 2 - p )/ p ( 1+2a ) . Note 
that 

2'/L I \p/2 & 

E E <E(^) P/2 <^ 2 " JSP 

»=1 V(j,fc)eBj / fc=i 
It then follows from Lemma 6 that 

Y E min 8 E ^sa^- 1 

i=J 2 i=i I U,k)eB' > 

(57) 

< Cn~ 2a/{l+2a) (log n )(2-p)/(p(i+2a)) _ 

On the other hand, 

Y E min 8 ^ 2 k ,&\*Ln~ l > 
i=io *=i I {j,k)eB' > 
(58) ^ 

< £ ^SA.Ln-^Cn-^+^Oogn)^)/^ 20 ". 
j=jo 6 
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We finish the proof for the case p < 2 by putting (53), (55), (57) and (58) 
together: 

E\\f n ~ /||| < Cn- 2 ^ 1+2a \\ogn)^/^ 1+2 ^\ 



-4 



Remark 7. To make the risk of b m negligible we need to have m 
o(n _2a /( 1+2 ")) (see Lemma 5), and to make the approximation error ||/j — 
ff 2 negligible, we need to have T" 2 ^- 1 /^ 1 ) = 0(n^ 2a ^ 1+2a ^) (see Lemma 
4). These constraints lead to our choice of m = n 1 ' 4 and T = n 3 ' 4 . Then we 

need |(a — ~) > or equivalently 2a 1 ^J 3 > |- This last condition is 

purely due to approximation error over Besov spaces. 

6.3. Local adaptation: Proof of Theorem 4- For simplicity, we give the 
proof for Holder classes A a (M) instead of local Holder classes A a (M,to,S). 
Note that for all / G A a (M), \G jik \ = \(f,ipj,k)\ < C2~^ 1 / 2 + a ^ for some con- 
stant C > not depending on /. Note also that for any random variables 
Xi,i = l,...,n, E{Y^ =l Xi) 2 < {YZ=i{EX 2 ) 1 / 2 ) 2 . It then follows that 

E(f n (to)-f(t )) 2 

~2-?0 „ oo 2-? 

= E ^2(9 jo> k - Oj ,k)(t>j ,k{ t o) + E E(^> fc ~ Oj,k)^j,k{tQ) 
-k=l j=j k=l 

- 2 

- (b m ~ b m ) 

2 j 

< [E(bm ~ bm) 2 ) 1 ' 2 + J2(E(l j0 , k - ~e 30 , k ?^l hk (t )) 1 ' 2 

k=l 

J-1 2i oo 2i 

+ E E(^;.fc - 0j,k) 2 tf, k (to)) 1/2 + E E Ihk^Mto) 

j=jo k=l j=J k=l 

= (Qi + Q2 + Qz + Qa) 2 . 

Lemma 5 yields that 

(59) Qx = (E(bm - bm) 2 ) 112 = o(n- a ^ 2a+ V). 

Since the wavelet ip is compactly supported, so there are at most N basis 
functions ipj tk at each resolution level j that are nonvanishing at to, where N 
is the length of the support of tp. Denote K(to,j) = {k : V^.fc^o) 7^ 0}. Then 
\K(to,j)\ < N. It is easy to see that both Q2 and Q4 are small: 

2 j 

(60) Q 2 = J2( E (ho,k ~ JO ,k) 2 ) 1/2 \^ O ,k{to)\ = 0(0 

k=i 
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and 

oo 2^ oo 

(61) Q 4 = EE I0i,fcll^i,fc(to)| < E A r HV'l|oo2 j/2 C2-^ 1 / 2+a ) < ct-°. 

j=Jk=l j=J 

We now consider the third term Q3. Applying the bound (41) in Proposition 
2 with r < 1/(1 + 2a) together with Lemma 4 and the bound for e^fc given 
in (44), we have 

J-i 

^3<E E ^ /2 |l^lloo(^,fc-%,fc) 2 ) 1/2 
3=3okeK(t ,j) 

J ~ 1 1 

(62) < C 2 j/2 [min(2- J '( 1+2a ) +T- 2 ( QA1 )2- J ',Ln- 1 ) +n" 2+T ] 5 

3=30 



Combining equations (59)-(63) we have 

E(fn(t ) ~ /(to)) 2 < C(logn/n) 2Q /( 1+2a ). 

6.4. Proofs of Theorems 1 and 2. Let G(x) be the cumulative distribu- 
tion function of X me( j and let (f(z) and &(z) denote respectively the density 
and cumulative distribution function of a standard normal random variable. 
Using similar arguments in the proof of Theorem 3 in Zhou (2005) or a 
sketch in Section 6 of Komlos, Major and Tusnady (1975), we need only to 
show 

(63) G(x) = <S>(V8kx)exp(0(k\x\ 3 + \x\+k~ 1/2 )) for - e < x < 
and 

(64) 1 - G{x) = (1 - ${V8kx)) exp(0(A:|x| 3 + \x\ + AT 1/2 )) 

for < x < e, 

where 0(x) means a value between — Cx and Cx uniformly for some constant 
C > 0. Related asymptotic expansions for the distribution of median can be 
found in current literature, for instance, Burnashev (1996), but the major 
theorems there are not sufficient to establish the median coupling inequality. 
Let H(x) be distribution function of X±. The density of the median -XVfc_m 

is 

9(x) = &jj^H k {x){l - H{x)) k h(x). 
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Stirling's formula, j\ = \p2mj^ x ' 2 exp(— j + €j) with ej = 0(1/ j), gives 
9(?) = ^T^^X 1 ~ H{x))fh{x) 

2V2k + l(2k + l\ 2fc+1 [4jff(x)(1 _ H (x))] k h(x)exJo fl 



ey/2n V 2k 
It is easy to see \y/2k + l/y/2k— 1| < k" 1 , and 

/2£; + l\ 2fc+1 / / 1 \\ / /l 



V 2. 7 -^[-^ + l ^[ l -WTT)) =exs, { 1 + \k 

Then we have, when < H(x) < 1 



(65) g(x) = ^t{4H(x)(l - H(x))] h h(x)exJo 



From the Lipschitz assumption in the theorem, Taylor's expansion gives 
AH(x)(l - H{x)) = 1 - 4(iT(x) - H{0)f 



1-4 



2 

(h(t) - h(0)) dt + h(0)x 



= l-4(/i(0)x + O(x 2 )) 2 

for < \x\ < e, that is, log(4F(x)(l - H(x))) = -4h 2 (0)x 2 + 0(\x\ 3 ) when 
|x| < 2e for some e > 0. Here e is chosen small enough such that h(x) > 
for | a; | < 2e. The Lipschitz assumption in the theorem also implies j&l = 
1 + 0(\x\) = exp(0(|x|)) for \x\ < 2e. Thus 



5 ( x ) = v 8/ ^°) exp(-8A:/i 2 (0)x 2 /2 + 0(k\x\ 3 + \x\ + AT 1 )) for \x\ < 2e. 

V 27T 



Now we approximate the distribution function of X me d by the distribution 
function of normal random variable. Without loss of generality we assume 
h(0) = 1. We write 



g(x) = -^texp(-8kx 2 /2 + 0(k\x\ 3 + \x\ + AT 1 )) for |x| < 2e. 



Now we use this approximation of density functions to give the desired 
approximation of distribution functions. Specifically we shall show 

(66) G(x)= [ X g(t)dt<<S>(V8kx)ex.p(Ck\x\ 3 + C|x| + CTc" 1 ) 
and 

(67) G(x) > $(V8k~x) exp(-CA;|x| 3 - C\x\ - Ck~ l ) 
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for all — e < x < and some C > 0. The proof for < x < e is similar. Now 
we give the proof for inequality (66). Note that 



{$(V8kx) exp(-Ckx 3 -Cx + Ck' 1 ))' 
(68) =V8kip(V8kx)exp(-Ckx 3 -Cx + Ck" 1 ) 

- <S>(V8kx)(3Ckx 2 - C) exp{-Ckx 3 -Cx + Ck~ l ). 



From Mill's ratio, inequality (33), we have <fr(y/8kx)(— V8fcx) < tp(V8kx) 
and hence 

-$(V8kx){3Ckx 2 ) exp(-Ckx 3 -Cx + Ck" 1 ) 

> V8kip(V8kx) (jj-xj exp(-Ckx 3 -Cx + Ck" 1 ). 

This and (68) yield 

(*( V8kx) exp(-Ckx 3 -Cx + Ck~ 1 ))' 

> V8kif(V8kx) (l + ^x^j exp(-Ckx 3 -Cx + Ck' 1 ) 

> V8kip(V8kx) exp(Cx/2) exp(-Ckx 3 -Cx + CAT 1 ) 

( C C 

> V8kip(V8kx) exp(- — kx 3 - —x + Ck' 1 

Here in the second inequality we apply (1 + C3x/8) > exp(Cx/2) when 
\Cx\ < C(2e) < 1/2. Thus we have 



(&(V8kx) exp(-Ckx 3 -Cx + Ck~ 1 ))' 

> V8kip(V8kx) exp(0(k\x\ 3 + \x\ + k' 1 )) 
for C sufficiently large and for — 2e < x < 0, then 



2e 



g(t)dt< I (<S>(V8kt)exp(-Ckt 3 -Ct + Ck~ 1 ))' 



2e 



<f>{V8kx) exp{-Ckx 3 -Cx + Ck^ 1 ) 
-$(V8fc • (2e)) exp(C(A;(2e) 3 + k" 1 )) 

< <f>(V8kx) exp(-Ckx 3 -Cx + CAT 1 ). 



In (65) we see 



-2e 



g(t)dt- 



- 2 H^ + i)\ TTkl ^ M „,^ u , 



(k\) 



-H K (t){l-H{t)) K h(t)dt 



^(- 2£ ) (2fe + l)l 
{k\) 2 



u k {\ - u) k du 
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Jh(~3e/2) (k\y 

<°^)( H(X) V^u\l -ufdu 

J H (-2e) [kiy 

^(k- 1 ) I g(t)dt, 
2s 



where the third equality is from the fact that u\{l — u\) k = o{k ^it^l ~~ u 2) k 
uniformly for u\ G [0,H(— 2e)\ and 112 G [H(—3e/2),H(—e)]. Thus we have 

G(x) < ${V8kx) eM-Ckx 3 -Cx + Ck^ 1 ), 
which is equation (66). Equation (67) can be established in a similar way. 

Remark. Note that in the proof of Theorem 1 it can be seen easily 
that constants C and e in equation (5) depend only on the ranges of h(0) 
and the bound of Lipschitz constants of h at a fixed open neighborhood of 
0. Theorem 2 then follows from the proof of Theorem 1 together with this 
observation. 
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